Physics 474 - Spring 2023
Homework 2 - Fitting Data, Parameter Estimation and Confidence Interval

Author: Djamil Lakhdar-Hamina


In this homework we will practice fitting a function with parameters to some data. In addition, we will place some emphisis also on determining the confidence interval for the fit parameters.

skills we will excercise:

  • reading in data
  • plotting data
  • writing user defined functions
  • fitting a function to data with 'curve_fit'
  • calculating χ2χ2 and χ2χ2 probaility
  • plotting residuals
  • dtermining confidence intervals using the Δχ2Δχ2
  • analyzing data and making observations

We will be investgating something called the Cosmic Microwave Background or CMB for short.

Quoted from Wikipedia:

"In Big Bang cosmology the cosmic microwave background (CMB, CMBR) is electromagnetic radiation that is a remnant from a primordial stage of the universe, also known as "relic radiation". The CMB is faint cosmic background radiation filling all space. It is an important source of data on the early universe because it is the oldest electromagnetic radiation in the universe, dating to the epoch of recombination when the first atoms were formed. With a standard optical telescope, the background space between stars and galaxies is almost completely dark. However, a sufficiently sensitive radio telescope detects a faint background glow that is almost uniform and is not associated with any star, galaxy, or other object. This glow is strongest in the microwave region of the radio spectrum. The accidental discovery of the CMB in 1965 by American radio astronomers Arno Penzias and Robert Wilson was the culmination of work initiated in the 1940s, and earned the discoverers the 1978 Nobel Prize in Physics."

We will be using measurements of CMB microwave intensity (W/m^2/sr/Hz) vs microwave frequency (GHz) compiled around 2000 in the source

Salvaterra and Burigana, 2000 arXiv:astro-ph/0206350

I have compiled the data into an easily read NumPy binary data file

"CMB_Intensity_Data.npz"

We will use this data to:

  • plot the intensity vs frequency
  • fit the data to a blackbody intensity vs frequency function to estimate the best-fit Temperature
  • plot the data along with the best fit
  • plot the residuals (fractional residuals in this particular case)
  • calcultae the best-fit χ2χ2
  • use the Δχ2Δχ2 as a function of temperature to estimate the confidence intervals for fit temperature

Background for the problem:

Black-Body Radiation

Every physical body (including the CMB surface) of temperature TT spontaneously and continuously emits electromagnetic radiation of radiance I(f;T)I(f;T) which describes the spectral emissive power per unit area, per unit solid angle, per unit frequency for particular radiation frequencies ff. The relationship is given by Planck's radiation law

I(f;T)=2hc2f3ehf/kT−1I(f;T)=2hc2f3ehf/kT−1

where hh is Plank's constant, kk is Boltzman's constant and cc is the speed of light. ___________________________________________________________________

You are given data for measurments of II vs frequency ff and are being asked to find the temperature TT that gives the best-fir between Plank's law and the data. You will then compare the best-fit to the data and estimate the confidence interval on the parameter TT. ___________________________________________________________________________

Part 1) (2 pts)

Read in the datafile 'CMB_Intensity_Data.npz' and print the "keys" in the file

e.g.

filename = 'CMB_Intensity_Data.npz'
Data = np.load(filename)
print(Data.files)

In [1]:
# Your code...
from math import exp
import numpy as np
import pandas as pd
import plotly.express as pex
import plotly.graph_objects as go
from plotly.subplots import make_subplots
from scipy.optimize import curve_fit
import scipy.stats as st 


## define constants 

h=6.62607015e-34
k=1.380649e-23
c=299792458
best_cmb_temp=2.72548

## useful functions

planck_radiation_law=lambda f,T: (2*h/c**2)*(f**3/(np.exp(h*f/(k*T))-1))

def chi_squared(theory:np.array,data:np.array,sigma:np.array)->np.array:
    """
    This function calculates TOTAL chi-squared between Theory and Data using sigma 
    as errors.
    The 3 arrays must be of equal size.
    Note: This is NOT reduced chi-squared
    Usage:  
     inputs: theory = input hypothesis (or Theory)
             data = Data points
             sigma = uncertainty on data points
     output: if arrays are of equal size returns the TOTAL chi-squared
             if arrays are not of equal size returns -1.0
    """
    if np.size(theory)==np.size(data) and np.size(data)==np.size(sigma):
        return np.sum((theory-data)**2/sigma**2)
    else:
        print('error - arrays of unequal size')
        return -1.
    
def delta_chi_squared(best_fit_parameter:float,
                      theory,
                      data_x:np.array,
                      data_y:np.array,
                      data_err:np.array,
                      chi2:float,
                      sigma_range:int,
                      sigma:float,
                      n_bins:int)->np.array:
    
        """
    This function calculates Delta chi-squared between Theory and Data using sigma 
    as errors.
    The 3 arrays must be of equal size.
    Note: This is NOT reduced chi-squared
    Usage:  
     inputs: theory = input hypothesis (or Theory)
             data = Data points
             sigma = uncertainty on data points
     output: if arrays are of equal size returns the TOTAL chi-squared
             if arrays are not of equal size returns -1.0
    """
    
        binvals = np.linspace(best_fit_parameter - sigma_range*sigma, best_fit_temp + sigma_range*sigma,n_bins)
        delta_chi_square = np.zeros(n_bins)

        for i in range(n_bins):
            delta_chi_square[i] = chi_squared(theory(data_x, binvals[i]),data_y,data_err) - chi2

        return binvals,delta_chi_square

## import data
filename='CMB_Intensity_Data.npz'
cmb_intensity_data=np.load(f'data/{filename}')
cmb_intensity_data.files
Out[1]:
['description', 'frequency', 'intensity', 'error']

Print the 'description'

e.g.

print(Data['description'])

In [2]:
## print description of data
cmb_intensity_data['description']
Out[2]:
array(['-----------------------------------------------',
       'This file contains data for Cosmic Microwave Background (CMB)',
       'Data compiled from salvaterra and burigana, 2000 arXiv:astro-ph/0206350',
       'Intensity measurements versus microwave frequency',
       'The data is given as frequency (GHz)',
       'CMB Intensity (W/m^2/sr/Hz) and error on Intensity',
       'description = this text describing data',
       'frequency = Frequency of mesurement in GHz',
       'intensity = CMB Intensity (W/m^2/sr/Hz)',
       'error = estimated experimental uncertainty of intensity in same units',
       'NOTE:', 'Removed 1 data point f= 113.6 Ghz T=2.279 K',
       'and used delta_T = 0.025K in place of 0.01K for last 40 points',
       '--------------------------------------------------'], dtype='<U71')

Part 2) (3 pts)

Make a plot of the data with errorbars vs frequency [GHz]

  • use log scale for both x and y axes
  • set x-ticks at 0.1, 1, 10, 100, 1000 GHz
In [3]:
# define plots for errorbars vrs frequency 

fig = go.Figure(data=go.Scatter(
        x=cmb_intensity_data['frequency'],
        y=cmb_intensity_data['intensity'],
        error_y=dict(
            type='data', # value of error bar given in data coordinates
            array=cmb_intensity_data['error'],
            color='orange')
   ,mode='markers' ))


fig.update_layout(
    title="Intensity vrs. Frequency: Spectral Density of Blackbody",
    xaxis_title="Frequency (Ghz)",
    yaxis_title="Intensity (W/m^2/sr/Hz)",
    legend_title="error",
    font=dict(
        family="Courier New, monospace",
        size=10,
        color="RebeccaPurple"
    ),
      xaxis = dict(
        tickmode = 'array',
        tickvals = [.1,1,10,100,1000],
    ),
       yaxis = dict(
        tickmode = 'array',
        tickvals = [10**-22,10**-21,10**-20,10**-19,10**-18],
    )
)

fig.update_xaxes(type="log")
fig.update_yaxes(type="log")

fig.show()
1101001×10​−22​1×10​−21​1×10​−20​1×10​−19​1×10​−18​
Intensity vrs. Frequency: Spectral Density of BlackbodyFrequency (Ghz)Intensity (W/m^2/sr/Hz)
plotly-logomark

Part 3a) (3 pts)

fit the data above to Plank's law to get the best fit temperature

  • print the best fit temperature
  • print the estimated 1-σσ error returned by curve_fit

In [4]:
# predicting the cmb temperature via chi-square
T0 = 2.0
poptT, pcovT=curve_fit(f=planck_radiation_law,
                        xdata=cmb_intensity_data['frequency']*1e9, ## convert from Ghz to Hz
                        ydata=cmb_intensity_data['intensity'],
                        p0=T0,
                        sigma=cmb_intensity_data['error'],
                        absolute_sigma=True)

sigma=np.sqrt(np.diag(pcovT))[0]
      
print(f"Predicted temperature of the CMB : {poptT[0]}")
print(f"The 1-sigma error : {np.sqrt(np.diag(pcovT))[0]}")
Predicted temperature of the CMB : 2.727795133568652
The 1-sigma error : 0.003575932124145145

Part 3b) (2 pts)

Calculate and print the χ2χ2, reduced-χ2χ2, and χ2χ2 probability


In [5]:
## calculating chi square , dof, and probability 

chi2=chi_squared(planck_radiation_law(cmb_intensity_data['frequency']*1e9,*poptT),cmb_intensity_data['intensity'],cmb_intensity_data['error'])
dof = np.size(cmb_intensity_data['intensity'])-1  #degrees of freedom = data points - #parameters (here 1)
prob = st.chi2.sf(chi2,dof) #chi-squared probability from scipy function

print(f"chi-squared : {chi2}")
print(f"χ-mu : {chi2/dof}")
print(f"dof : {dof}")
print(f"prob : {prob}")
chi-squared : 117.93663675312395
χ-mu : 1.281919964707869
dof : 92
prob : 0.03551609242598948

Observations:

The best-fit temperature acheived from curve-fit gives a reduced chis-square of around 1.28, indicating that our theory plus the best fit parameter track the data quite well (since it is reasonably close to 1). The low chi-square probability means that around 3.5 % of the time we would get a higher chi-square simply by chance. Empirically, our best fit determines a temperature of around 2.73K which is in fact the approximate value quoted in the literature for the average temperature for the cosmic microwave background radiation. Chi-square minimization seems to be a good method for determining the temperature of the CMB. _______________________________________________________________________

Part 4) (5 pts)

Make a single figure with two subplots

  • top: data with errorbars and best fit curve vs frequency (same parameters as plot above)
  • bottom: fractional residuals (i.e residual/Radiance = (data-Fit)/fit) with errorbar (only log on x-scale)

In [6]:
# sort data for the fitted line
sorted_data=np.sort(cmb_intensity_data['frequency'])
fit_data=planck_radiation_law(cmb_intensity_data['frequency']*1e9,*poptT)
fractional_residuals=(cmb_intensity_data['intensity']-fit_data)/fit_data

fig = make_subplots(rows=2, cols=1)

## actual data in intesnity vrs frequency hz

fig.add_trace( 
   go.Scatter(
        x=cmb_intensity_data['frequency'],
        y=cmb_intensity_data['intensity'],
        error_y=dict(
            type='data', # value of error bar given in data coordinates
            array=cmb_intensity_data['error'],
            color='orange')
   ,
        mode='markers',name='Frequency' ),row=1,col=1),

## best fit line 

fig.add_trace( 
   go.Scatter(
        x=sorted_data,
        y=planck_radiation_law(sorted_data*1e9,*poptT),name='Best Fit: Planck Law'),row=1,col=1)

## error bar line

fig.add_trace( 
   go.Scatter(
        x=cmb_intensity_data['frequency'],
        y=fractional_residuals ,mode='markers',error_y=dict(type='data',array=cmb_intensity_data['error']/fit_data,color='orange'),name='Fractional Residuals'),row=2,col=1)

fig.update_layout(
    title="Intensity vrs. Frequency: Planck's Law and the Spectral Density of Blackbody",
    font=dict(
        family="Courier New, monospace",
        size=10,
        color="RebeccaPurple"
    )
    ,height=800,width=1000)

fig.update_xaxes(title_text="Frequency (GHz)", type="log",tickmode='array',tickvals=[.1,1,10,100,1000], row=1, col=1,showticklabels=True)
fig.update_yaxes(title_text="Intensity (W/m^2/sr/Hz)", tickmode='array',tickvals=[10**-22,10**-21,10**-20,10**-19,10**-18], type='log',row=1, col=1)
fig.update_xaxes(title_text="Frequency (GHz)", type="log",tickmode='array',tickvals=[.1,1,10,100,1000],  row=2, col=1)
fig.update_yaxes(title_text="Fractional Residuals (Intensity)",  row=2, col=1)
fig.show()
1101001×10​−22​1×10​−21​1×10​−20​1×10​−19​1×10​−18​110100−0.500.5
FrequencyBest Fit: Planck LawFractional ResidualsIntensity vrs. Frequency: Planck's Law and the Spectral Density of BlackbodyFrequency (GHz)Frequency (GHz)Intensity (W/m^2/sr/Hz)Fractional Residuals (Intensity)
plotly-logomark

Observations:

The top-image suggests that our theory+best-fit parameter leads to a very good fit for the data, surprisingly good. However, the bottom-image does clarify that in fact there is some discrepancy between fit and data. It seems that errors are more so distributd in the low-frequency range than in the high. However, there are many more points in the higher-frequency range, so that overwhelmingly residuals are neaer the zero-line. That being said, the best-fit line intersects nearly every single point, and the relatively low fractional residuals, especially in higher-frequency range, are further evidence that the best fit parameter of around 2.73K is close to the true parameter. Furthermore, it would seem that our one-parameter theory , Planck's law, is well confirmed when given the proper parameter.

Part 5) (5 pts)

Now calculate the Δχ2Δχ2 vs temperature for ±4σ±4σ (based on return of curve_fit) around the best-fit temperature TfitTfit. That is:

Δχ2(T)=χ2(T)−χ2(Tfit)Δχ2(T)=χ2(T)−χ2(Tfit)

make a plot

  • Δχ2(T)Δχ2(T) vs T
  • vertical line at TfitTfit
  • vertical dotted line at currrent world best estimate of T=2.72548KT=2.72548K
  • horizontal lines at Δχ2Δχ2 of 1, 4, 9

In [7]:
# sigma as determined by covariance matrix 
sigma=np.sqrt(np.diag(pcovT))[0]
best_fit_temp=poptT[0]
chi_square_best_fit=chi2

n_bins = 10000
tempvals = np.linspace(best_fit_temp - 4*sigma, best_fit_temp + 4*sigma,n_bins)
delta_chi_square = np.zeros(n_bins)

for i in range(n_bins):
    delta_chi_square[i] = chi_squared(planck_radiation_law(cmb_intensity_data['frequency']*1e9, tempvals[i]),cmb_intensity_data['intensity'],cmb_intensity_data['error']) - chi2
    
layout=dict(
    title="$\Delta\chi(T)^{2}   vrs.  Temperature  (Kelvin)$ ",
    xaxis_title="Temperature (K)",
    yaxis_title=r"$\Delta\chi(T)^{2}$",
    font=dict(
        family="Courier New, monospace",
        size=10,
        color="RebeccaPurple")
    ,height=800,width=1000 ,
    
     yaxis = dict(
        tickmode = 'array',
        tickvals = list(range(0,20,5)),
    ),
    xaxis = dict(
        tickmode = 'array',
        tickvals = [best_fit_temp +i*sigma for i in range(-4,5)],
        ticktext = ["$T_{fit}-4\sigma$",
                      "$T_{fit}-3\sigma$","$T_{fit}-2\sigma$",
                      "$T_{fit}-\sigma$",
                      "$T_{fit}$",
                     "$T_{fit}+1\sigma$",
                     "$T_{fit}+2\sigma$",
                     "$T_{fit}+3\sigma$",
                     "$T_{fit}+4\sigma$"]
    ))


temp_delta_chi_trace = dict(
    x = tempvals,
    y = delta_chi_square,
    mode = 'lines',
    type = 'scatter',
    line_shape='spline',
    connectgaps = True,
    name="$\Delta\chi(T)^{2}$",
    showlegend=True)

best_fit_temp_trace=dict(
    x=[best_fit_temp]*10,
    y=list(range(0,18,2)),
    mode = 'lines',
    type = 'scatter',
    line_shape='spline',
    line_color='purple',
    name='Best Fit Temperature (K)')


best_current_temp_trace=dict(
    x=[best_cmb_temp]*10,
    y=list(range(0,18,2)),
    mode = 'lines',
    type = 'scatter',
    line_shape='spline',
    line=dict(color='red', width=1, dash='dash'),
    name="Current Best Measurement (K)")

data=[temp_delta_chi_trace,best_fit_temp_trace,best_current_temp_trace]
fig = go.Figure(data = data,layout=layout)

for num in [1,4,9]:
    fig.add_hline(y=num,opacity=.4,line_color='red',annotation_text=f"$\Delta\chi(T)^{2}={num}$")

fig.show()
$T_{fit}-4\sigma$$T_{fit}-3\sigma$$T_{fit}-2\sigma$$T_{fit}-\sigma$$T_{fit}$$T_{fit}+1\sigma$$T_{fit}+2\sigma$$T_{fit}+3\sigma$$T_{fit}+4\sigma$051015
Δχ(T)2Δχ(T)2Best Fit Temperature (K)Current Best Measurement (K)$\Delta\chi(T)^{2} vrs. Temperature (Kelvin)$Temperature (K)$\Delta\chi(T)^{2}$$\Delta\chi(T)^2=1$$\Delta\chi(T)^2=4$$\Delta\chi(T)^2=9$
plotly-logomark

Summary and Conclusions:

(comment on)

  • what do the horizontal lines at Δχ2Δχ2 of 1, 4, 9 represent?
  • what are the approximate 68% and 95% confidence intevals on the fit temperature from this data?
  • what do these confidence intervals represent?
  • how do the Δχ2Δχ2 of 1, 4, 9 compare to the estimate of the 1-sigma error retuned by curve fit?

For a model with one-degree of freedom, the horizontal lines , in intersection with the Δχ2Δχ2 graph, represent lines which when projected on the x-axis, give confidence intervals. For the 68% confidence interval, at σσ, the fit temperature is between 2.726696526942884 and 2.7338516562404718. For the 95% interval, the fit temperature is between 2.7231189622940897 and 2.737429220889266. The 1-line intersects the Δχ2Δχ2 graph, and determies a region of x between two points of intersection. The projection of that line onto then x-axis represents that 68.3 % of the time the parameter, the CMB temperature, is within that region between −σ−σ and σσ. The 4-line determines a region which when projected x-wise represents the 95.4 % confidence interval. 95.4% of the time the parameter lies within the region between −2σ−2σ and 2σ2σ. Finally, the 9-line represents the determination of the region between −3σ−3σ and 3σ3σ where we have 99.73 % confidence that the true parameter is within that x-region i.e. 99.73 % of the time our parameter is found in that region. For a model of one-degree of freedom, the lines 1,4,9 intersect at multiples of 1−σ1−σ, determined from the covaraince matrix, at respectively 1,2,3 σσ. We see a complete overlap of the 1,4,9 line and multples of the 1−σ1−σ error. This overlap may be an artifact of the way the x-ticks were created, or have to do with the fine-grained nature of the binning i.e. the fact that the Δχ2Δχ2 was computed for over 10000 different values. The "true" parameter, corresponding to the state of the art measurement of the CMB temperature, is not even 1−σ1−σ from chi-square minimized value. Our procedure therefore determined a value with low error, or difference from the true value. It would seem that on the one hand not only have we determined a relatively close value to the CMB temperature but also that the Planck radiation law is a firm basis for making such a prediction.